Generalization in quantum machine learning from few training data

Modern quantum machine learning (QML) methods involve variationally optimizing a parameterized quantum circuit on a training data set, and subsequently making predictions on a testing data set (i.e., generalizing). In this work, we provide a comprehensive study of generalization performance in QML after training on a limited number N of training data points. We show that the generalization error of a quantum machine learning model with T trainable gates scales at worst as \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{T/N}$$\end{document}T/N. When only K ≪ T gates have undergone substantial change in the optimization process, we prove that the generalization error improves to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{K/N}$$\end{document}K/N. Our results imply that the compiling of unitaries into a polynomial number of native gates, a crucial application for the quantum computing industry that typically uses exponential-size training data, can be sped up significantly. We also show that classification of quantum states across a phase transition with a quantum convolutional neural network requires only a very small training data set. Other potential applications include learning quantum error correcting codes or quantum dynamical simulation. Our work injects new hope into the field of QML, as good generalization is guaranteed from few training data.

complexity bounds lead to generalization bounds for QMLMs that depend on spectral properties (more precisely, rank or Frobenius norm) of the parameterized measurement.
Shortly thereafter, Ref. [22] studied the generalization capabilities of QMLMs with a focus on the strategy used to encode classical data into the quantum circuit. In particular, they considered data encodings via Hamiltonian evolutions, where data re-uploading is allowed. For corresponding QMLMs, Ref. [22] established generalization bounds that depend explicitly on properties of the Hamiltonians used for data-encoding. These results are complementary to our work: The generalization guarantees of Ref. [22] depend only on the encoding strategy used in the QMLM, whereas our results are in formulated in terms of properties of the trainable part of the QMLM only.
Ref. [23] investigated the expressibility and the generalization behavior of specific QMLMs. By combining light cone arguments with insights into how a specific data-encoding leads to effective dimensionality limitations (see also [22]), Ref. [23] obtained VC-dimension bounds for the hardware efficient ansatz. These bounds depend on the number of qubits and on the number of trainable layers. Ref. [23] interpreted the overall limitation on the VC-dimension imposed by the data-encoding as an automatic regularization, which is helpful in avoiding overfitting.
Lastly, Ref. [24] investigated a problem of learning parametrized unitary quantum circuits from training data consisting of pairs of input and corresponding output states. They established generalization bounds, and thus sample complexity bounds, by first identifying a universal family of variational quantum circuit architecture, then considering a finite discretization of this family, and finally applying a standard generalization bound for finite hypothesis classes. We note that the generalization guarantee obtained from Theorem 6 is tighter than that obtained in Ref. [24]: For a variational n-qubit QMLM with at most n c gates, [24,Theorem 2] implies that a sample complexity ofÕ(n c+1 ) suffices for good generalization, whereas Theorem 6 tells us that alreadyÕ(n c ) samples suffice. Additionally, our generalization guarantees apply for more general architectures than those considered in [24].

Related Work on Quantum Phase Recognition
Recognizing quantum phases of matter is an important question in physics. Recently, many works have considered training machine learning models to classify quantum phases. The works include the use of quantum neural networks [25] and classical machine learning models [26][27][28][29]. Most of the existing works do not come with rigorous guarantees. Thus, it is not clear whether the respective machine learning models will predict well after training. Our work shows that when a quantum neural network, such as a QCNN [25], can perform well on a training set with a moderate amount of examples, the quantum neural network will also predict well on new data. This is particularly prominent in QCNNs, for which the required training data size scales at most polylogarithmically in the system size. However, in order for quantum neural networks to achieve a small training error, one still needs to address various challenges, such as barren plateau in the training landscape [30,31].
Recently, [32] has proposed provably efficient classical machine learning models that can classify a wide range of quantum phases of matter, including symmetry-broken phases, topological phases, and symmetry-protected topological phases. These classical machine learning models are efficient in both computational time and the required training data [32]. Furthermore, the numerical experiments of [32] have shown that no labels of the different phases are needed to train the classical machine learning models. The classical algorithm can automatically uncover the quantum phases of matter in an unsupervised learning procedure.
It remains to be seen if QMLMs, such as QCNNs [25], can improve upon classical machine learning models in classifying quantum phases of matter. For example, [32] shows that the prediction performance of classical machine learning models sometimes degrades when the correlation length in the ground state wave function is high. It would be interesting to understand whether QMLMs can still work well when classical machine learning models fail.

Related Work on Quantum Compiling
Compiling of quantum circuits is a broad field with many distinct approaches. For example, temporal planning [33,34], reinforcement learning [35], and supervised learning [36,37] are three alternative approaches that have been applied to quantum compiling. Moreover, while classical methods for quantum compiling are the most common, it has also been proposed to do quantum-assisted quantum compiling where a quantum computer is involved in the compiling process [38][39][40][41].
While not all methods employ training data, it is worth noting that some state-of-the-art methods are in fact based on training data [36,37,42]. It is also worth remarking that noise-aware quantum compiling methods can involve training data [37]. For these methods, it has largely been assumed that one would need an amount of training data that grows exponentially with the number of qubits. Naturally, this exponential scaling places a cutoff on the size of unitaries that one can compile. However, with our results in mind (allowing for only polynomial-sized training sets), this cutoff can be significantly extended to larger unitary sizes.
For quantum compiling, the benefit of our work is two-fold, in that both classical methods and quantum methods can potentially be sped-up. Classical methods for quantum compiling are currently being used in the quantum computing industry to enhance the performance of cloud-based quantum computing (e.g., by companies such as Rigetti and IBM). Therefore, speeding up classical methods for quantum compiling can potentially have a direct impact on cloud-based quantum computing. Both standard compiling and noise-aware compiling are important for industrial near-term quantum computing, and our work impacts both of these approaches.
In addition, quantum-assisted methods for quantum compiling can also reduce their resource costs based on our results. Variational quantum algorithms for quantum compiling have been introduced [38][39][40][41]. Specifically, Refs. [38,39,43] discussed methods that employ an entangled state on 2n qubits to compile an n-qubit unitary. Due to our work, this entangled state can apparently be reduced in size, namely only needing a Schmidt rank that is polynomial in n (instead of a Schmidt rank that is exponential in n). Ref. [40] proposed a slightly different approach that did not involve an auxiliary system, but simply used multiple training data points. Our work shows that the amount of training data here does not need to grow exponentially in n, making the approach in Ref. [40] potentially scalable.
Finally, we note that variable ansatz methods (e.g., Ref. [44,45]) for quantum compiling is a state-of-the-art approach that is employed, e.g., in Refs. [36,37]. As noted in the main text, our results are general enough to cover the variable ansatz case (where the structure of the circuit changes during the optimization). Hence we provide guidance for how much training data is needed for the variable ansatz case as well.

Supplementary Note 2. Auxiliary Lemmata
Before presenting our results, we use this section to recall some well known auxiliary results that enter our proofs.

Auxiliary Lemmata from Statistical Learning Theory
We use two standard concentration inequalities. The first is due to Wassily Hoeffding.
Lemma 1 (Hoeffding's Concentration Inequality [46]). Let X 1 , . . . , X N be independent R-valued random variables. Assume that, for every 1 i N , The second is the bounded differences inequality, originally due to Colin McDiarmid.
Lemma 2 (McDiarmid's Concentration Inequality [47]). Let X 1 , . . . , X N be independent random variables, each with values in Z. Let ϕ : Z N → R be a measurable function s.t., whenever z ∈ Z n and z ∈ Z n differ only in the i th entry, then |ϕ(z) − ϕ(z )| b i . Then, for every ε > 0, we have The third well known ingredient that we will employ in our reasoning without giving a proof is the following.

Auxiliary Lemmata from Quantum Information Theory
From quantum information theory, we crucially make use of the following lemma.
Lemma 4 (Subadditivity of diamond distance; see [49], Proposition 3.48). For any completely positive and tracepreserving maps A, B, C, D, where B and D map from n-qubit to m-qubit systems and A and C map from m-qubit to k-qubit systems, we have the following inequality Also, to translate between the spectral norm of unitaries and the diamond norm of the corresponding channels, we employ the following result.
Lemma 5 (Spectral norm and diamond norm of unitary channels). Let U(ρ) = U ρU † and V(ρ) = V ρV † be unitary channels. Then, 1 2 U(|ψ ψ|) − V(|ψ ψ|) 1 (U − V )|ψ 2 for any pure state |ψ . Therefore, Proof. The proof is adapted from [50]. Fix an input |ψ and denote the output state vectors by |u = U |ψ and |v = V |ψ , respectively. Normalization ensures that these state vectors obey | u, v | 1, as well as |u − |v 2 = 2(1 − Re( u, v )). Apply the Fuchs-van de Graaf relations [51] to convert the output trace distance into a (pure) output fidelity: The diamond distance bound then is a direct consequence of this relation. Using the fact that stabilization is not necessary for computing the diamond distance of two unitary channels [49], we get Here, we have also used the definition of the operator norm.

Supplementary Note 3. Analytical Results: Details and Proofs
We first introduce some standard notation. Let D(H) denote the set of density operators (positive semi-definite with unit trace) acting on the Hilbert space H. Let L(H) denote the space of square linear operators acting on H. Let L(H, H ) denote the set of linear operators taking H to a Hilbert space H . The trace norm of a linear operator A ∈ L(H, H ) is defined as The trace distance between any two operators A, B ∈ L(H, H ) is A − B 1 , and for two quantum states ρ, σ ∈ D(H) it is linearly related to the maximum success probability of distinguishing ρ and σ in a quantum hypothesis testing experiment. A linear map , where H RA = H R ⊗ H A and the reference system R can be of arbitrary size. Moreover, a linear map N A→B : L(H A ) → L(H B ) is trace preserving (TP) if Tr(N A→B (X A )) = Tr(X A ) for all X A ∈ L(H A ). A linear map N A→B is called a quantum channel if it is completely positive and trace preserving (CPTP). Let N A→B and M A→B denote quantum channels. Then the diamond distance between N A→B and M A→B is defined as where I R is the identity map acting on H R . As a consequence of the convexity of the trace norm and the Schmidt decomposition theorem, it suffices to optimize Eq. (13) over pure states in H RA with dim(H R ) = dim(H A ).
With the notation in place, we now present our analytical results. Generalization performance depends crucially on the metric entropy, which characterizes both classical and quantum machine learning models [52]. Metric entropy is a measure of complexity or expressiveness for a set of objects endowed with a distance metric.
In Supplementary Note 3. 1., we take the diamond norm as the distance metric and prove metric entropy bounds for two sets of interest. First, we examine the set U A of all unitaries that can be represented using a (fixed) variational quantum circuit A with T parameterized 2-qubit unitary gates. More precisely, we consider the corresponding set of unitary channels. Second, we study the set CPT P A of all CPTP maps that can be represented using a (fixed) variational quantum circuit A with T parameterized 2-qubit CPTP maps. The latter scenario generalizes the former and corresponds to the difference between perfect and noisy implementations. Note that, in both cases, the variational quantum circuit itself could contain more than T gates. However, these additional gates would have to be fixed and not trainable.
Using these metric entropy bounds and variants thereof, we establish prediction error bounds for variational quantum machine learning models (QMLMs) in terms of the number of trainable elements in Supplementary Note 3. 2.. We consider different scenarios of interest, among them that of using multiple copies of a quantum neural network (such that parameters are reused over different copies), as well as both fixed and variable circuit architectures.

Covering Number Bounds for Variational Quantum Circuits
In this section, we provide bounds on the expressivity of the class of CPTP maps (or unitaries) that a quantum machine learning model (QMLM) can implement in terms of the number of trainable elements used in the architecture. As a measure of expressivity, we choose covering numbers and metric entropies w.r.t. (the metric induced by) the diamond norm. We first recall the general definition of covering numbers and metric entropies.
Definition 1 (Covering nets, covering numbers, and metric entropies). Let (X, d) be a metric space. Let K ⊂ X be a subset and let ε > 0.
of K if and only if K can be covered by ε-balls around the points in N .
• The covering number N (K, d, ε) is the smallest possible cardinality of an ε-covering net of K.
• The metric entropies log 2 N (K, d, ε) are the logarithm of the covering numbers.
In finite-dimensional real spaces, the covering numbers of norm balls, and thereby of norm-bounded sets, can be bounded easily. We make use of this observation to provide basic covering number bounds for the classes of 2-qubit unitaries and 2-qubit CPTP maps. We first state the bound for the unitary case.
Lemma 6 (Covering number bounds for 2-qubit unitaries). Let || · || be a unitarily invariant norm on complex 4 × 4matrices. The covering number of the set of 2-qubit unitaries U C 2 ⊗ C 2 w.r.t. the norm || · || can be bounded as Proof. It is well known (see, e.g., Section 4.2 in [53]) that the covering numbers of a norm-ball of radius R > 0 around some point x ∈ R K , for 0 < ε R, can be bounded as where the ball and the coverings are taken w.r.t. the same norm. In our scenario, we can apply this as follows: As || · || is assumed to be unitarily invariant, we have ||U || = ||1 C 4 || for every unitary U ∈ U C 2 ⊗ C 2 . In particular, we have, for R : is the ball of matrices with 4 × 4 = 16 complex entries around the 0-matrix is taken w.r.t. || · ||. Therefore, we have where the first step uses the approximate monotonicity of (interior) covering numbers (see, e.g., Section 4.2 in [53]).
This covering number bound becomes particularly useful for the spectral norm, for which ||1 C 4 || = 1.
With an analogous reasoning, we can prove a covering number bound for 2-qubit CPTP maps.
Lemma 7 (Covering number bounds for 2-qubit CPTP maps). The covering number of the set of 2-qubit CPTP maps CPT P C 2 ⊗ C 2 w.r.t. the diamond distance can be bounded as Proof. As CPTP maps have diamond norm equal to 1, this follows (analogously to the previous Lemma) by upperbounding the covering number of the diamond-norm unit ball, which lives in a (2 4 × 2 4 )-dimensional space over the complex numbers. The latter can be achieved as in the previous Lemma.
We combine these basic upper bounds for single trainable elements with sub-additivity of the diamond norm (Lemma 4) to obtain a covering number bound for the class of maps that can be implemented by a variational QMLM, understood as a parametrized CPTP map as described in the main text. Again, we first state the bound for the unitary case. Then, for every ε ∈ (0, 1], there exists an ε-covering net N ε of (the set of unitary channels corresponding to) U QMLM w.r.t. the diamond distance such that the logarithm of its size can be upper bounded as Proof. Let ε ∈ (0, 1], writeε := ε 2T . By Lemma 6, there exists anε-netÑε of U C 2 ⊗ C 2 w.r.t. the spectral norm of size |Ñε| ( 6 /ε) 32 t T , are a particular choice of the trainable 2-qubit unitaries and V s , 0 s T + 1, are the non-trainable n-qubit unitaries occurring in the QMLM. (For ease of readability, we have not written out the tensor factors of identities accompanying the U t .) We now consider the set of unitaries obtained by plugging the elements ofÑε as trainable 2-qubit unitaries into the QMLM. That is, we take Let U ∈ U QMLM be an arbitrary n-qubit unitary that can be implemented by the QMLM, i.e., U = Let U denote the corresponding unitary channel. AsÑε is anε-net for the set of 2-qubit unitaries, we can findŨ t ∈Ñε, 1 t T , such that ||U t −Ũ t || ε for all 1 t T . Then, the unitary channelŨ described byŨ : where we iteratively applied sub-additivity of the diamond distance (Lemma 4) in the first step, then used the relation between the diamond distance of unitary channels to the spectral norm distance of the corresponding unitaries (Lemma 5), and in the final step plugged in the definition ofε. Thus, we have shown that the set of unitary channels with unitaries in N ε is an ε-covering net of the set of unitary channels with unitaries in U QMLM w.r.t. the diamond distance. As |N ε | = |Ñε| T (by definition of N ε ), plugging in the bound on the size ofÑε then gives the desired bound on the cardinality of N ε and thereby of our covering net.
For variational quantum circuits consisting of CPTP maps, we obtain an analogous result upon replacing Lemma 6 by Lemma 7 in the previous proof: Theorem 2 (Metric entropy bounds for QMLMs of CPTP maps). Let E QMLM θ (·) be an n-qubit QMLM with T parameterized 2-qubit CPTP maps and an arbitrary number of non-trainable, global CPTP maps. Let CPT P QMLM ⊂ CPT P (C 2 ) ⊗n denote the set of n-qubit CPTP maps that can be implemented by the circuit QMLM E QMLM θ (·).
For any ε ∈ (0, 1], there exists an ε-covering net N ε of CPT P QMLM w.r.t. the diamond distance such that the logarithm of its size can be upper bounded as In both scenarios, the metric entropy grows at worst slightly super-linearly with the number of parameterized (and thus trainable) operations.
We also provide a generalization of these metric entropy bounds that is natural for the scenario in which trainable gates are reused in the quantum machine learning model: Theorem 3 (Metric entropy bounds for QMLMs of reused CPTP maps). Let E QMLM θ (·) be an n-qubit QMLM with T parameterized 2-qubit CPTP maps, in which the t th of these maps is used M t times, and an arbitrary number of non-trainable, global CPTP maps. Let CPT P QMLM ⊂ CPT P (C 2 ) ⊗n denote the set of n-qubit CPTP maps that can be implemented by the QMLM E QMLM θ (·). For any ε ∈ (0, 1], there exists an ε-covering net N ε of CPT P QMLM w.r.t. the diamond distance such that the logarithm of its size can be upper bounded as Proof. We can use the same reasoning as in the proof of Theorems 1 and 2 to show that we can define an ε-covering net N ε for CPT P QMLM (w.r.t. || · || ) by plugging the elements of anε t -net for CPT P(C 2 ⊗ C 2 ) into the positions of the QMLM corresponding to the t th independently trainable 2-qubit map, whereε t := ε T ·Mt . When picking thẽ ε t -nets with cardinality bounded as in Lemma 7, the cardinality of N ε can be bounded as Taking a logarithm gives the claimed metric entropy bound.
The growth of the metric entropies in terms of T , the number of independently trainable maps, is still at most slightly super-linear. But the growth in terms of the numbers of times that the trainable maps are reused is only logarithmic.
Note that we have formulated the metric entropy bounds for the qubit case only, but they can naturally be extended to the qudit case. Then the upper bound will depend polynomially on the dimension d.
We provide one more metric entropy bound for QMLMs, which also takes the training procedure into account, in Theorem 8. Formulating this bound, however, requires us to fix some (notational) assumptions on the optimization procedure used for training. Therefore, we postpone this final metric entropy bound to Supplementary Note 3. 2. 5.. Remark 1. Both in this section and in the following ones, we formulate our results for QMLMs whose parametrized gates act on (at most) 2 qubits. Our proofs and results straightforwardly extend to the case in which the parametrized gates act on (at most) κ qubits. In particular, when going from 2-to κ-local, the T -dependence remains the same. Only the constant prefactors in the metric entropy bounds (and thus the generalization bounds) change, namely from 2 · 2 4 to 2 · 2 2κ in the unitary case, and from 2 · 2 8 to 2 · 2 4κ in the CPTP case. Since κ is constant, then the latter is just prefactor that does not change the scaling of our theorems.

Prediction error bounds for quantum machine learning models
Using well-established tools from statistical learning theory, we can derive prediction error bounds for QMLMs from the covering number bounds established in Supplementary Note 3. 1.. Before doing so, we describe our setting in detail.
During the training process, we optimize the parameters α in the (CPTP map implemented by the) quantum machine learning model E QMLM α (·) according to some criteria and depending on the training data. Here, we write α = (θ, k) if we consider both discrete, structural parameters k and continuous parameters θ. If the QMLM structure is fixed and only the continuous parameters are optimized, we write only θ (instead of α). Note that we do not make any further assumptions on how the QMLM E QMLM α (·) depends on the parameters α = (θ, k) other than that the discrete parameters only encode different choices of quantum circuit architectures. In particular, the dependence of the trainable gates on the continuous parameters θ can be arbitrary.
We use an observable to quantify how good/bad the output state is, this will serve as our loss function. More concretely, for an input x i and (classical or quantum) target output y i , we define the loss function of the parameter setting α to be for some Hermitian observables O loss xi,yi . Here, x → ρ(x) is some encoding of the classical data into quantum states that is fixed in advance.
As is common in classical learning theory, the prediction error bounds will depend on the largest (absolute) value that the loss function can attain. In our case, we therefore assume C loss := sup x,y ||O loss x,y || < ∞. That is, we assume that the spectral norm can be bounded uniformly over all possible loss observables.
For a training dataset the average loss on the training data is given bŷ which is often referred to as the training error or empirical risk. This quantity can (in principle) be evaluated, given the parameter setting α and the training data.
When we obtain a new input x, the prediction error of a parameter setting α is taken to be the expected loss where the expectation is w.r.t. the distribution P from which the training examples are generated. This quantity is called the prediction error or expected risk. The goal of any (classical or quantum) machine learning procedure is to achieve a small prediction error with high success probability.
As the underlying distribution P is usually unknown, we cannot directly evaluate the prediction error, even if we know the parameters α. In practice, one therefore often takes the training error as a proxy for the prediction error. However, this procedure can only succeed if the difference between the prediction and the training error, the so-called generalization error, is small. Our covering number bounds allow us to prove rigorous bounds on the generalization error incurred by a variational quantum machine learning method in the so-called "Probably Approximately Correct" (PAC) sense. That is, we provide bounds on the generalization error in terms of the desired success probability and the training data size. Thereby, our results provide guarantees on the prediction performance of a quantum machine learning model on unseen data, if that model performs well on the training data.
Our main result is the following: Theorem 4 (Mother Theorem). Let E QMLM α (·) be a QMLM with a variable structure. Suppose that, for every k ∈ N, there are at most G τ ∈ N allowed structures with exactly τ parameterized 2-qubit CPTP maps, in which the t th of these maps is taken from a set M t and used M t times, and an arbitrary number of non-trainable, global CPTP maps. Also, for each t ∈ N, let E 0 t ∈ CPT P (C) ⊗2 be a fixed reference CPTP map. Let P be a probability distribution over input-output pairs. Suppose that, given training data S = {(x i , y i )} N i=1 of size N , our optimization of the QMLM over structures and parameters w.r.t. the loss function (α; ) yields a (data-dependent) structure with T = T (N ) independently parameterized 2-qubit CPTP maps, in which the t th of these maps is taken from M t and used M t times, as well as the parameter setting α * = α * (S).
Then, with probability at least 1 − δ over the choice of i.i.d. training data S of size N according to P , where ∆ T 1 , . . . , ∆ T T denote the (data-dependent) distance between the trainable maps in the output QMLM to the respective reference maps E 0 1 , . . . , E 0 T , C loss = sup x,y ||O loss x,y || is the maximum (absolute) value attainable by the loss function, and the minimum is over all K ∈ {0, . . . , T } and choices of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T }.
Moreover, if the loss is not evaluated exactly, but an unbiased estimator is built from σ est subsampled training data points (as in Supplementary Note 3. 2. 6.), we only incur an additional error of O log( 1 /δ) /σest . Some of the important aspects of the upper bound on the generalization error of a QMLM provided by Theorem 4 are: a dependence on the square root of the inverse of the training data size (N ); an at worst slightly superlinear dependence on the number of trainable maps (T ), which can improve if only a smaller number (K) of gates experience non-negligible changes during the optimization; a logarithmic dependence on the number of uses (M t ); a logarithmic dependence on the number of different architectures (G T ); and a logarithmic dependence on the reciprocal of the desired confidence level (δ).
We build up to the proof of Theorem 4 by first establishing our basic QML generalization error bound and then extending it in different directions. More precisely, we structure our presentation as follows: We start with the pedagogical Supplementary Note 3. 2. 1., in which we show a simple proof of how metric entropy bounds lead to generalization bounds, albeit not yet to their strongest form. In Supplementary Note 3. 2. 2., we demonstrate how to improve upon the simple proof strategy using a more involved technique. Then, we extend the basic generalization error bounds in multiple directions, namely to multiple copies and reused trainable maps (Supplementary Note 3.  Supplementary Figure 1. Visualization of the proof structure. We prove metric entropy bounds and use them to derive generalization bounds for different QMLM settings. We then apply our theory to quantum phase recognition and unitary compiling.

Remark 2.
A loss function of the form of Eq. (24) automatically has a certain linear structure, namely, it depends linearly on the output state (E QMLM α ⊗id)(ρ(x i )). Notice, however, that we can introduce also a certain type of nonlinearity through the spectral decomposition of the loss observables O loss xi,yi . Namely, suppose that we obtain a classical output from the QMLM by measuring an observable O out with spectral decomposition O out = j λ j | j j |. That is, given an input x i , we output λ j with probability p j (α, |j . Now, we can, for example, define the loss observables as O loss xi,yi : becomes the expected square loss between the true label y i and our output λ i . Here, the expectation is w.r.t. (p j (α, x i )) j . Clearly, here we can replace (y i − λ j ) 2 by any nonlinear loss function˜ (y i , λ j ) of interest.

Prelude: Metric entropy bounds imply generalization error bounds
This section is intended to help readers not yet familiar with the theory of classical machine learning gain an intuition for how we derive our analytical results. We present a technically simple proof of a generalization bound for a fixed-architecture QMLM, which, however, is worse than that of Theorem 6 by a factor logarithmic in the training data size. Therefore, readers already well versed in statistical learning theory, or readers who want to focus on the results and not the proofs, can safely skip this pedagogical section.
We demonstrate how the metric entropy bound from Theorem 2 gives rise to a generalization bound for QMLMs with a fixed architecture, in which each trainable 2-qubit CPTP map is used only once. The simplified proof given in this section consists in combining Hoeffding's concentration inequality (Lemma 1) with a union bound over a suitable covering net. Informally, we show that it suffices to prove good generalization simultaneously for all elements in a covering net, which we can obtain from a union bound over standard concentration guarantees for each single element of the covering net.
Theorem 5 (Prediction error bound for quantum machine learning -Fixed structure (Preliminary version)). Let E QMLM θ (·) be a QMLM with a fixed architecture consisting of T parameterized 2-qubit CPTP maps and an arbitrary number of non-trainable, global CPTP maps. Let P be a probability distribution over input-output pairs. Suppose that, given training data S = {(x i , y i )} N i=1 of size N , our optimization yields the parameter setting θ * = θ * (S). Then, with probability at least 1 − δ over the choice of i.i.d. training data S of size N according to P , Proof. For any parameter setting θ, fixed independently of the choice of training data, we see that (θ; x 1 , y 1 ), . . . , (θ; x N , y N ) are independent random variables taking values in [−C loss , C loss ]. So Hoeffding's Lemma (Lemma 1) tells us that, ∀η > 0, have Here, P S [·] = P S∼P N [·] denotes the probability over training data sets from the probability measure P . Next, we let ε = T /N > 0, take N ε to be an ε-covering net of the set of CPTP maps that can be implemented by the QMLM, and we take a union bound over N ε , with which we obtain As we took N ε to be an ε-covering net (w.r.t. the diamond norm) of the class of CPTP maps that the QMLM can implement, and since ||E −Ẽ|| ε directly implies, for all x ∈ X , y ∈ Y, we conclude, because of the form of the loss function , that Thus, for any δ ∈ (0, 1), by choosing η = C loss 2 log( |Nε| /δ) N , we can guarantee that, with probability at least 1 − δ over the choice of training data S of size N , we have Now, we recall that, by Theorem 2, we can take N ε to satisfy log(|N ε |) 512T log( 6T /ε). Plugging this into the previous bound, we see that, with probability at least 1 − δ over the choice of training data of size N , we have which is the claimed generalization error bound.
Remark 3. At first glance, it might seem that simply plugging the parameter setting θ * into Eq. (29) would already give us a good concentration bound for the parameter setting θ * obtained through training and that the union bound over the covering net is not actually necessary in the above proof. However, as the parameter setting θ * = θ * (S) depends on the whole training data set S, the random variables (θ * ; x i , y i ), i = 1, . . . , N , are not statistically independent. Thus, Hoeffding's inequality alone cannot be used to obtain a version of Eq. (29) with θ replaced by the data-dependent θ * .
The generalization bound established in Theorem 5 already shows the right behavior in terms of the dependence on T , the number of trainable maps. However, the dependence on N , the sample size, still contains an undesirable logarithmic term. In classical statistical learning theory, it is well known that a proof strategy as above, based on combining Hoeffding's concentration inequality with a union bound over a covering net, incurs such a log(N )-term. Fortunately, a technique for removing this term is also known and we will use it to tighten the prediction error bound in the next subsection.

Basic prediction error bound for fixed architecture
Our first prediction error bound is for the case of a variational QMLM with a fixed architecture. In particular, while the parameters in the trainable 2-qubit CPTP maps can be optimized over, the structure of the QMLM, i.e., the arrangement of the different elements, and in particular the overall depth and size, remain fixed. (We provide a generalization to variable circuit architectures in Supplementary Note 3. 2. 4..) In this scenario, we have the following generalization error bound: Theorem 6 (Prediction error bound for quantum machine learning -Fixed structure). Let E QMLM θ (·) be a QMLM with a fixed architecture consisting of T parameterized 2-qubit CPTP maps and an arbitrary number of non-trainable, global CPTP maps. Let P be a probability distribution over input-output pairs. Suppose that, given training data of size N , our optimization yields the parameter setting θ * = θ * (S). Then, with probability at least 1 − δ over the choice of i.i.d. training data S of size N according to P , In case the training data contains quantum labels, we assume the training data states to be reproducible so that we can use the data both for the optimization procedure and for evaluating the training error.
Proof. The proof proceeds in two steps: The first step is to upper-bound the generalization error in terms of the expected supremum of a random process. (This well known technique is described, e.g., in Theorem 3.3 in [54].) In the second step, we invoke the chaining technique to further upper-bound this expected supremum in terms of covering numbers. (This method goes back to [5]. See, e.g., Section 8 of [53] for a pedagogical presentation.) At this point, we apply our covering numbers bounds to finish the proof.
differ only in a single labelled example, then |ϕ(S) − ϕ(S)| 2C loss/N (because the loss function has values in [−C loss , C loss ]). Therefore, we can apply McDiarmid's inequality (Lemma 2) and obtain that, for every Hence, for every δ ∈ (0, 1), with probability 1 − δ /2 over the choice of training data, we have , with θ * = θ * (S) as in the statement of the Theorem, We now upper-bound ES[ϕ(S)]. To this end, we introduce a so-called ghost sample. Namely, we take to be an i.i.d. copy ofS. Then, we can bound Now, we use a standard symmetrization argument with i.i.d. Rademacher random variables to further upper-bound the right hand side. That is, we let σ 1 , . . . , σ N be i.i.d. Rademacher random variables, each distributed uniformly on {−1, 1}. As multiplying ( (θ; x i , y i ) − (θ;x i ,ỹ i )) by −1 is equivalent to interchanging the i.i.d. copies (x i ,ỹ i ) and (x i , y i ), which leaves the expectation invariant, we can introduce an additional expectation value over Rademacher variables as follows: The quantity on the right hand side is not an empirical quantity, i.e., it cannot be directly evaluated only from the training data without knowledge of the underlying distribution P . However, another application of McDiarmid's inequality shows that, for every ε > 0, where we again used that the loss function has values in [−C loss , C loss ]. In other words, for every δ ∈ (0, 1), with probability 1 − δ /2 over the choice of training data, we have When applying a union bound, we can combine Eq. (39) and (45) to conclude: For every δ ∈ (0, 1), with probability 1 − δ over the choice of training data of size N , we have This concludes the first step of the proof.
As a second step, we use chaining to upper-bound E σ sup θ 1 N N i=1 σ i (θ; x i , y i ) in terms of covering numbers. For j ∈ N 0 , define α j := 2 −j C loss . By Theorem 2, for every j ∈ N 0 , there exists an 2 −j -covering net N j (w.r.t. the diamond norm) of the set of CPTP maps that can be implemented by the QMLM, satisfying |N j | = ( 6T /2 −j ) 512T = 6T · 2 j 512T . In particular, for every j ∈ N and for every parameter setting θ, there exists a CPTP map E θ,j ∈ N j and ||E QMLM θ − E θ,j || 2 −j . For j = 0, we can take the 1-covering net N 0 = {0}. With this observation at hand, we can bound, for any m ∈ N, where we used the telescopic sum representation E QMLM . We bound the two summands appearing in Eq. (51) separately. For the first term, we can apply Hölder's inequality to obtain For the second term, we observe that, thanks to Minkowski's inequality, for every parameter setting θ, Therefore, for each 1 j m, we can apply Massart's Lemma (Lemma 3) to the set with radius 3α j √ N and cardinality |N j | · |N j−1 | |N j | 2 to obtain where we used the bound on the sizes of the covering nets in the last step.
If we now use 2 −j = 2 2 −j 2 −j−1 dα, we can rewrite the upper bound as where, in the first inequality, we used that 2 j 1 /α holds inside the limits of the integral. Combining Eq. (57) and (68), we have proved that, for every m ∈ N, If we take the limit m → ∞, this becomes where we used the integral We can now combine Eq. (46) with (74) and obtain: With probability 1 − δ over the choice of training data of size N , we have which is the claimed prediction error bound.
Remark 4. For simplicity, throughout the proof of Theorem 6 we have treated θ * (S) as a deterministic function of S. However, the proof extends to the case in which the parameter setting θ * (S) is a random variable depending on S. Then, the generalization error bound would hold with high probability over the choice of the training data and over the internal randomness of the optimization procedure. This is the case for all our prediction error bounds and is important because quantum subroutines in QML procedures make them inherently probabilistic.

Remark 5.
A conceptual difference between the proof of Theorem 5 and that of Theorem 6, which can also be seen as an underlying reason for why the latter leads to a tighter bound than the former, is the following: To prove Theorem 5, we used a T /N-covering net for the set of CPTP maps that the QMLM can implement. This can be seen as measuring the complexity of the QMLM at a single specific resolution, namely the resolution ε = T /N. In contrast, the proof of Theorem 6 considers a complexity measure for the QMLM obtained by averaging over complexities (here, covering numbers) at multiple different resolutions. Thus, from a high-level view, the chaining-based proof strategy for Theorem 6 improves upon the reasoning behind Theorem 5 by taking multiple resolutions into account.
Theorem 6 can be interpreted as follows: By taking the training data size N to effectively scale linearly in the number of trainable elements T , we can ensure that a small training error also implies a small prediction error (with high probability).
In the following, we describe extensions of Theorem 6 to different scenarios of interest, and then summarize these in a general "mother theorem."

Extension to multiple copies and gate-sharing
In practice, one often employs quantum machine learning models that reuse the same parameterized gates multiple times, such as quantum convolutional neural networks (QCNNs) [25]. In such a scenario, we speak of "gate-sharing". While the number of trainable elements in such models can still be large, only few of them can be trained independently. As a first extension of Theorem 6, we show that the generalization performance of such a models is determined by the effective number of independently trainable elements. Corollary 1. Let E QMLM θ (·) be a QMLM with a fixed architecture consisting of T independently parameterized 2-qubit CPTP maps, in which the t th of these maps is used M t times, and an arbitrary number of non-trainable, global CPTP maps. Let P be a probability distribution over input-output pairs. Suppose that, given training data S = {(x i , y i )} N i=1 of size N , our optimization yields the parameter setting θ * = θ * (S). Then, with probability at least 1 − δ over the choice of i.i.d. training data S of size N according to P , Proof. The proof strategy is the same as for Theorem 6, we only change the covering number bound to be applied. Namely, instead of Theorem 2, we use Theorem 3. More precisely, we recall Eq. (46), which tells us: For every δ ∈ (0, 1), with probability 1 − δ over the choice of training data of size N , we have And, with the same chaining technique as detailed in the proof of Theorem 6, we can bound the above expectation over Rademacher random variables as where we used the notation from Theorem 3 for the set CPT P QMLM of n-qubit CPTP maps that can be implemented by the QMLM. Now, we use the metric entropy bound proved in Theorem 3 to further upper bound the integral as 1 /2 0 log (N (CPT P A , || · || , α)) dα 512T log(6T ) As x → log( 1 /x) has an integrable singularity at x = 0, the integral in this expression is simply a multiplicative constant. Therefore, after plugging in the bound of Eq. (81) into Eq. (79), and then plugging the resulting bound on the Rademacher expectation into Eq. (78), we obtain: For every δ ∈ (0, 1), with probability 1 − δ over the choice of training data of size N , we have the claimed generalization bound.
A naive approach to the scenario of Corollary 1 would be to upper-bound the metric entropy, and thus the prediction error, in terms of the total number of trainable elements in the QMLM. That, however, would lead to a significantly worse dependence of the prediction error bound on M t , the numbers of uses, namely, of the form Our more careful analysis shows the tighter bound in which the numbers of uses M t only appear logarithmically, which is crucial for our application of the bound to quantum phase recognition with QCNNs (see Supplementary Note 4.). This is possible because, even though there are in principle T T t=1 M t trainable elements in the quantum neural network, they are not trained independently. Rather, the parameter setting for the t th parameterized 2-qubit CPTP map is reused M t times. This clearly shows that reusing parameters is, from a generalization perspective, preferable to having more independent parameters.
As a special case of Corollary 1, we obtain a prediction error bound for the scenario in which multiple copies of a QMLM (with the same parameter settings) are run in parallel: be a QMLM with a fixed architecture consisting of T independently parameterized 2-qubit CPTP maps and an arbitrary number of non-trainable, global CPTP maps. By using M copies of this model in parallel, we can consider loss functions of the form where O loss x,y are observables acting on the M -fold tensor product of an n-qubit system. Let P be a probability distribution over input-output pairs. Suppose that, given training data S = {(x i , y i )} N i=1 of size N , our optimization yields the parameter setting θ * = θ * (S).
Then, with probability at least 1 − δ over the choice of i.i.d. training data S of size N according to P , ) holds for all a, b 0, we see that the upper bound in Corollary 2 can be rewritten as If δ is taken to be a fixed desired accuracy level and C loss is also considered to be a fixed constant, this becomes the bound stated in Theorem 2 in the main text. Corollary 2 tells us that, even when using many copies of the QMLM, as the expressiveness of the corresponding function class grows at most logarithmically with the number of copies, we can still obtain a good prediction error. Note that, as in Corollary 1, it is crucial that the same parameter setting is used for each copy.
We can also phrase the result as follows: We can upper-bound the prediction error incurred when using multiple copies of a quantum machine learning model for evaluating the loss by an expression that depends crucially on the number of trainable elements per copy, but only mildly on the number of copies.
Remark 6. Cases of interest that Corollary 2 describes are, e.g., the loss functions obtained by first performing (independent) product measurements on each of the M copies, then taking an average (for a continuous target space) or a majority vote (for a discrete target space) of the obtained measurement outcomes, and finally post-processing this value by a classical loss function (such as the squared error loss). Such procedures arise naturally when taking into account that multiple shots are needed to accurately estimate the expectation value of an observable measured on the QMLM output. Note, however, that we cannot apply arbitrary procedures for post-processing single-copy measurement outcomes and still hope for a good prediction error. If C loss , which here is the supremum over the spectral norms of the M -copy observables O loss x,y , scales badly (e.g., linearly) with M , the prediction error bound does so as well.

Extension to variable circuit architecture
For practical purposes, it might not be advantageous to fix the number of trainable elements in the QMLM, or even its structure more generally, in advance. Rather, one might also want to optimize over a discrete set of possible architectures, e.g., by growing or truncating the QMLM during the training phase. Therefore, in our second extension of Theorem 6, we provide a prediction error bound for such a variable structure scenario. Corollary 3. Let E QMLM α (·) be a QMLM with a variable structure. Suppose that, for every τ ∈ N, there are at most G τ ∈ N allowed structures with exactly τ parameterized 2-qubit CPTP maps and an arbitrary number of non-trainable, global CPTP maps. Let P be a probability distribution over input-output pairs. Suppose that, given training data S = {(x i , y i )} N i=1 of size N , our optimization yields a (data-dependent) structure with T = T (S) parameterized 2-qubit CPTP maps and the parameter setting α * = α * (S).
Then, with probability at least 1 − δ over the choice of i.i.d. training data S of size N according to P , Proof. By Theorem 6, for every τ ∈ N, for every one of the G τ allowed structures with exactly τ parameterized 2-qubit CPTP maps, with probability 1 − δ /2Gτ τ 2 over the choice of i.i.d. training data S of size N according to P , if θ * τ is a (continuous) parameter setting (for the τ parameterized maps) obtained through optimization upon input of training data S, we have the generalization error bound Thus, first taking a union bound over the G τ structures with exactly τ parameterized 2-qubit CPTP maps, and then a union bound over τ ∈ N, we see: With probability 1 − τ δ /2τ 2 1 − δ over the choice of i.i.d. training data S of size N according to P , if the optimization upon input of data S outputs a QMLM architecture with T = T (N ) parameterized 2-qubit CPTP maps and the (continuous and discrete) parameter setting α * = α * (S) as claimed.
We can understand Corollary 3 as saying that the prediction error of a QMLM with a variable structure depends strongly (namely linearly) on T , the number of trainable elements that is used in the output structure of the optimization procedure, but only mildly (namely logarithmically) on G T , the number of different possible structures with the same number of gates as the output structure. Note that the bound does not depend on all structures potentially considered during the optimization, but only on a subset of those. In particular, if the number T of trainable 2-qubit maps is fixed in advance, optimizing not only over the parameter settings of the model but also over exponentially-in-T many structures with T trainable elements does not worsen the asymptotic behavior of the generalization error.

Extension taking the optimization into account
In our previous results, we have provided bounds on the generalization error that depended on the QMLM, e.g., via the number of trainable elements or the number of copies, or even on how many different architectures are admissible. So far, however, the bounds are agnostic w.r.t. the procedure used to train the QMLM. In this section, we refine our approach to prove optimization-dependent generalization bounds, that explicitly take properties of the training process into account.
Take M 1 , . . . , M T to be sets of 2-qubit CPTP maps. Each set M t denotes the space of 2-qubit CPTP maps that one permits for the t th trainable map during the training of the QMLM E QMLM θ . Hence, each M t should be seen as the trainable space for a particular gate in the QMLM. For example, M t could be the space of all 2-qubit unitary channels, or the space of all tensor products of single-qubit CPTP maps.
As discussed in the proofs of Lemmas 6 and 7, as CPT P(C 2 ⊗ C 2 ) is compact, for every 1 t T , there exists a constant c t 1, depending, e.g., on the diameter and on the effective ambient dimension of M t , such that log(N (M t , || · || , )) c t log 1 + 1 .
Note that, as a worst-case estimate, we have c t 1024. This can be seen by arguing as in the proofs of Lemmas 6 and 7, with ambient dimension 512 and diameter 2, and then applying Bernoulli's inequality.
Given a fixed choice of parameters θ, and thereby a fixed choice E 1 ∈ M 1 , . . . , E T ∈ M T of the trainable 2-qubit CPTP maps, the (fixed-architecture) QMLM implements the n-qubit CPTP map where the F t , for 0 t T , are fixed (potentially global) CPTP maps. Suppose that we begin the optimization of the QMLM from an initial point independent of the training data S = {(x i , y i )} N i=1 , described by a parameter setting θ 0 . We denote the choices for the T trainable maps appearing in the initial CPTP map by After utilizing the training data for multiple rounds of optimization, the training of the QMLM finishes at a (datadependent) point in CPT P ((C) ⊗n ), described by a parameter vector θ * , which we denote by the choice of trainable maps. Note that E * 1 , . . . , E * T depend on the training data S. For each of the T trainable local CPTP maps M t , we denote the distance (measured w.r.t. || · || between the initial and the final point of the training procedure by In the following Theorem, we provide a generalization guarantee for the resulting QMLM defined by the choice of trainable local maps E * 1 , . . . , E * T in terms of the optimization distances ∆ t , the number T of trainable maps, and the number N of training data points. Theorem 7 (Optimization-dependent prediction error bound for quantum machine learning). Let E QMLM θ (·) be a QMLM with a fixed architecture consisting of T parameterized 2-qubit CPTP maps, in which the t th of these maps is taken from M t , and an arbitrary number of non-trainable, global CPTP maps. Let P be a probability distribution over input-output pairs. Suppose that, given training data S = {(x i , y i )} N i=1 of size N , the optimization procedure yields the parameter setting θ * = θ * (S). As described above, denote by ∆ t = ∆ t (S) the optimization distance (measured in diamond norm) of the t th trainable map.
Then, with probability 1 − δ over the choice of i.i.d. training data S of size N 4 according to P , where the minimum is over all K ∈ {0, . . . , T } and choices of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T }.
The proof of Theorem 7 again hinges on a metric entropy bound, this time for the class of CPTP maps that can be reached by the QMLM under the optimization procedure. Hence, let us first prove the following theorem.
Theorem 8. Let E QMLM θ (·) be a QMLM with a fixed architecture consisting of T parameterized 2-qubit CPTP maps, in which the t th of these maps is taken from M t , and an arbitrary number of non-trainable, global CPTP maps. Let 0 ∆ 1 , . . . , ∆ T 2 be a sequence of distances for the trainable 2-qubit CPTP maps. Let CPT P QMLM (∆t)t ⊂ CPT P (C 2 ) ⊗n denote the set of n-qubit CPTP maps that can be implemented by the QMLM, under the additional restriction that the t th trainable gate is at most diamond-distance ∆ t away from the fixed initial point E 0 t . Let K ∈ {0, . . . , T }. Let t 1 , . . . , t K ∈ {1, . . . , T } be pairwise distinct. Then, for any ε ∈ (0, 1], if we write there exists an ε K -covering net N ε K of CPT P QMLM (∆t)t w.r.t. the diamond distance such that the logarithm of its size can be upper bounded as Proof. By the assumptions on the structure of CPT P QMLM (∆t)t , there exist fixed, global CPTP maps F 0 , . . . , F T ∈ CPT P (C 2 ) ⊗n such that any E ∈ CPT P QMLM (∆t)t can be written as As discussed above, for each 1 t T , we can take N t to be an ( ε /K)-covering net for M t w.r.t. || · || whose cardinality satisfies log(|N t |) c t log (1 + K / ). We define N ε K to be the set of CPTP maps that can be implemented by A if exactly K of the trainable 2-qubit CPTP maps are taken from N t1 , . . . , N t K , respectively, and the last T − K trainable maps are left at the initial point of the optimization. That is, we define Using the subadditivity of the distance induced by the diamond norm (Lemma 4), it is easy to see that, for every Thus, N ε K is indeed an ε K -covering net for CPT P QMLM (∆t)t , as claimed. It remains to observe that, by definition of N ε K , we have as claimed.
Armed with this metric entropy bound, we can now prove Theorem 7.
Proof of Theorem 7. Starting from the metric entropy bound of Theorem 8, we again argue as in the proof of Theorem 6. Recall that the first step of said proof was to establish Eq. (46). This was then followed in a second step by upperbounding the obtained expression using a covering number integral. The first step, leading to Eq. (46), is also valid in the scenario of this Theorem. That is, we again have that, with probability 1 − δ over the choice of training data of size N , However, we have to change the second step. To this end, we first observe that, by a reasoning completely analogous the one leading up to Eq. (69), for every m ∈ N 0 , where we used the notation from Theorem 8. Fix a K ∈ {0, . . . , T } and pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T } such that∆ and take m ∈ N 0 such that∆ < 2 −(m+1) < 2∆. Then in particular 2 −m 4∆ and we can further upper bound the expression in Eq. (105) as At this point, we can apply the metric entropy bound from Theorem 8 to obtain Altogether, so far we have shown that, for any fixed choice of K ∈ {0, . . . , T } and of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T } such that T t=1 t =t1,...,t K ∆ t < 1 2 , with probability 1 − δ over the choice of training data of size N , we have After a union bound over the at most T K T K different choices of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T } (with T t=1 t =t1,...,t K ∆ t < 1 2 ), we see that, with probability 1 − δ over the choice of training data of size N , we have where K ∈ {0, . . . , T } is still fixed and the minimum is over all choices of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T }. Finally, we can take a union bound over at most T + 1 different values of K and obtain that, with probability 1 − δ over the choice of training data of size N , we have where the minimum is over all values of K and over all choices of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T }.
If, in the generalization error bound of Theorem 7, we disregard the potential improvements gained from the M tdependent constants c t and instead replace all of them by their worst-case value 1024, we can simplify the bound to because K log(K) K log(T ) for all K = 1, . . . , T . Moreover, instead of taking a minimum over all such K and over all choices of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T }, we can take the minimum only over K, and fix the choice t k = k to obtain If we again fix a confidence level δ and consider C loss as a fixed constant of the problem, this becomes the bound given in Theorem 3 for the case M = 1. (The case for general M follows from our "mother theorem," see Supplementary Note 3. 2. 7..) We can clearly see that, if the optimization has only made substantial changes to few trainable maps, then the generalization error bound in Theorem 7 is dominated by the maps that have undergone more significant changes during optimization. The number of such parameterized maps could be much smaller than the overall number of the trainable CPTP maps T . Therefore, this optimization-dependent generalization error bound can significantly outperform the previous bounds, which did not take the optimization procedure into account, w.r.t. the dependence on T .
One consequence of this theorem is that a good choice of initialization for the optimization of a QMLM can not only serve to improve the cost of the optimization itself, but it can also help the generalization behavior. Namely, a particularly good choice of initialization, potentially found through pretraining on an independent data set, can lead to an optimization procedure that does not have to deviate too far from the initialization w.r.t. some of the trainable maps, which, according to our bound, will be advantageous for generalization.
A second implication of this result for what to take into account in designing an optimization procedure for training a QMLM is the following: Making large steps only on few trainable gates and only negligibly small steps on the remaining ones is, from a generalization perspective, preferable to making steps of comparable, non-negligible sizes on many (or even all) of the trainable gates.
Remark 7. We note that in the proof of Theorem 7, it was not necessary that the fixed CPTP maps E 0 1 ∈ M 1 , . . . , E 0 T ∈ M T were given as the initialization of the optimization procedure. In fact, we can take these maps to be any fixed "reference points" w.r.t. which we measure distances. Th proof then works without changes, as long as the reference maps are indeed fixed in advance, independently of the training data.

Extension to unbiased estimates of measurement statistics
In practice, we cannot obtain the exact value of Tr O loss xi,yi (E QMLM θ ⊗ id)(ρ(x i )) for a training example (x i , y i ) if we only perform finitely many measurements. Instead, as a proxy for the training error, we consider an unbiased estimator: For 1 σ σ est , with σ est ∈ N fixed, we independently pick i σ uniformly at random from {1, . . . , N } and measure the observable O loss xi σ ,yi σ on the output state E QMLM θ (ρ(x iσ )) to yield a single measurement outcome o loss θ,σ ∈ −||O loss xi σ ,yi σ ||, ||O loss xi σ ,yi σ || . As E o loss where the expectation is taken w.r.t. the sampling of i σ and the randomness in obtaining the measurement outcome. This yields a finite sequence of observations o loss θ,1 , . . . , o loss θ,σest , with In this scenario, where we only obtain a noisy estimate of the training error from σ est measurements, the prediction performance guarantee takes the following form: be a quantum machine learning model with a fixed architecture consisting of T parameterized 2-qubit CPTP maps. Let P be a probability distribution over input-output pairs. Suppose that, given training data S = {(x i , y i )} N i=1 of size N , our optimization yields the parameter setting θ * = θ * (S). Then, with probability at least 1 − δ over the choice of i.i.d. training data S of size N according to P , over the sampling of i 1 , . . . , i σest , and over the σ est obtained measurement outcomes, Proof. We first insert a zero in terms of the empirical risk as follows: By Theorem 6, with probability at least 1 − δ 2 over the choice of the training data, the first term on the righthand side (which is independent of the subsampling and of the obtained measurement outcomes) is bounded as . By Hoeffding's inequality, for any fixed S and θ * , with probability at least 1 − δ 2 over the sampling of i 1 , . . . , i σest and over the σ est obtained measurement outcomes, the second term on the right-hand side is C loss 2 log( 2 /δ) σest . Therefore, we also have Now, the statement of the Corollary follows via a union bound.
This shows that we do not need to perform a disproportionately large number of measurements to guarantee that the estimated training error is indeed a good proxy for the prediction error. It suffices to choose σ est to be roughly N /T log(T ), along with N being sufficiently larger than T log(T ), to guarantee that the prediction error will not be much higher than the approximate (observed) training error.

Mother theorem
We can summarize all the previously discussed extensions of Theorem 6 in Theorem 4, which we restate here for convenience: Theorem 4 (Mother Theorem). Let E QMLM α (·) be a QMLM with a variable structure. Suppose that, for every k ∈ N, there are at most G τ ∈ N allowed structures with exactly τ parameterized 2-qubit CPTP maps, in which the t th of these maps is taken from M t and used M t times, and an arbitrary number of non-trainable, global CPTP maps. Also, for each t ∈ N, let E 0 t ∈ CPT P (C) ⊗2 be a fixed reference CPTP map. Let P be a probability distribution over input-output pairs. Suppose that, given training data S = {(x i , y i )} N i=1 of size N , our optimization of the QMLM over structures and parameters w.r.t. the loss function (α; x i , y i ) = Tr O loss xi,yi (E QMLM α ⊗ id)(ρ(x i )) yields a (datadependent) structure with T = T (N ) independently parameterized 2-qubit CPTP maps, in which the t th of these maps is taken from M t and used M t times, as well as the parameter setting α * = α * (S).
Then, with probability at least 1 − δ over the choice of i.i.d. training data S of size N according to P , where ∆ T 1 , . . . , ∆ T T denote the (data-dependent) distance between the trainable maps in the output QMLM to the respective reference maps E 0 1 , . . . , E 0 T , C loss = sup x,y ||O loss x,y || is the maximum (absolute) value attainable by the loss function, and the minimum is over all K ∈ {0, . . . , T } and choices of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , T }.
Moreover, if the loss is not evaluated exactly, but an unbiased estimator is built from σ est subsampled training data points (as in Supplementary Note 3. 2. 6.), we only incur an additional error of O log( 1 /δ) /σest .
Proof. To prove this most general version of our results, we combine the previous results and proof strategies. First, fix τ ∈ N and one of the G τ admissible QMLM architectures with exactly τ trainable 2-qubit CPTP maps, in which the t th of these maps is taken from M t and used M t times. With the same strategy as in the proof of Theorem 8, if we take N t to be a ( ε /KMt)-covering net for M t w.r.t. || · || , and consider the set of n-qubit CPTP maps obtained from the QMLM if exactly K of the τ independently trainable 2-qubit CPTP maps are taken from the respective N t , and the remaining τ − K maps are left at the corresponding reference map, this gives us an ε K -covering net N ε of the class of n-qubit CPTP maps that the QMLM architecture can implement, where This ε K -covering net can be taken to have cardinality bounded as If we use this metric entropy bound for the chaining argument presented in the proof of Theorem 7, we can show that, with probability 1 − δ /2Gτ τ 2 over the choice of i.i.d. training data S of size N , if θ * τ is a (continuous) parameter setting for the τ parameterized maps obtained through optimization upon data S, we have where the minimum is over K ∈ {0, . . . , τ } and over choices of pairwise distinct t 1 , . . . , t K ∈ {1, . . . , τ }.
We now first take a union bound over the G τ admissible structures and then another union bound over τ ∈ N to obtain: With probability 1 − δ over the choice of i.i.d. training data S of size N , if the optimization upon input of S outputs a QMLM architecture with T = T (N ) parameterized 2-qubit CPTP maps and the (discrete and continuous) parameter setting α * = α * (S), then we have the generalization error bound with the minimum as claimed.
To understand the added error in the case in which an unbiased estimate of the empirical risk is used, we now repeat the analysis given in Supplementary Note 3. 2. 6., but use the generalization bound just established instead of the one from Theorem 6, and obtain the claimed bound.
Also for Theorem 4, we shortly explain how this leads to the result stated as Theorem 5 in the main text. First, Theorem 5 only considers the case M t = M for all t. Second, just like presented in Supplementary Note 3. 2. 5., we can bound the constants c t by their worst-case upper bound of 1024, and then take a minimum not over all K and all choices of t 1 , . . . , t T , but only over all K with the fixed choice t k = k. With these two simplifications, the bound becomes Finally, once we plug in a constant confidence level δ and also consider C loss as a constant dictated by the problem, we end up with the bound of Theorem 5 from the main text.
Remark 8. In Theorem 4, we have chosen a fixed reference map E 0 t ∈ CPT P (C) ⊗2 for every t ∈ N. One could even choose different reference maps for each k and for each of the G k allowed structures with exactly k parameterized maps.
In Supplementary Note 3. 2. 5., we have taken the initial point of the optimization procedure as reference point for evaluating distances. A similar interpretation is possible in Theorem 4, however, the reference points can be more abstract. In principle, the reference maps can be chosen freely, as long as the choice is independent of the training data sample w.r.t. which the empirical risk is evaluated.
Remark 9. We present our results for the case of a QMLM E QMLM α (·) acting on a quantum input state ρ(x). If x describes classical data, this presumes an "encoding-first" architecture, in which the classical-to-quantum dataencoding x → ρ(x) is applied first, followed by a trainable quantum circuit. As observed in [22,55,56], the expressive power of a QMLM for processing classical data can significantly benefit from allowing for data re-uploading [57]. This is achieved by allowing for a more flexible form of QMLM, in which data-encoding and trainable gates can be interleaved. Our results, which focus on the trainable part of the QMLM circuit, directly extend to QMLMs with data re-uploading.
This can be seen as follows: In our proofs of the metric entropy bounds from Subsection 3. 1., we already allowed for an interleaving of the trainable gates with arbitrary fixed gates. The same reasoning applies if we replace the fixed gates by encoding gates depending on the classical input data x, as long as they are still independent of the trainable parameters.

Supplementary Note 4. Application to quantum phase recognition
As a second application of our prediction error bounds, we demonstrate their implications for quantum phase recognition (QPR) with quantum convolutional neural networks (QCNNs). Here, for each training example (|ψ i , y i ), the encoded input is simply ρ(x i ) = |ψ i ψ i |, a pure n-qubit quantum state that belongs to one of four possible quantum phases of matter. The corresponding output label y i ∈ {0, 1} 2 tells us to which of the four phases ρ(x i ) belongs. The goal of a quantum machine learning model for this scenario is to accurately predict, given a new input x, the corresponding label, and thus the phase, of the state ψ i .
In our language, a QCNN acting on n-qubit states, as introduced in [25], is a QMLM E QCNN θ (·) with a particular fixed structure, explained in more detail in Section II.C of the main text, consisting of log(n) independently parameterized 2-qubit maps, each of which is used at most n times. By measuring some of the qubits and then discarding them in pooling layers, the QCNN maps an n-qubit input to a 2-qubit output, on which it then performs a computational basis measurement. The phase prediction that the QCNN makes for an n-qubit input state is the one corresponding to the smallest of the four outcome probabilities in the computational basis measurement on the output state. This can be well approximated by running multiple gate-sharing copies of the QCNN in parallel and appropriately post-processing the single measurement outcomes. For simplicity of presentation, we showcase our bounds in the scenario of only one copy of the QCNN. However, this extends to multiple gate-sharing copies according to Corollary 2. Thus, we consider the loss function characterized by the loss observables O loss xi,yi = O loss yi = | y i y i |, which is independent of x i . This means the loss function is given by That is, the QMLM achieves a small value of the loss function on the example (|ψ i , y i ) if the probability observing y i when performing a computational basis measurement on the output state, upon input |ψ i , is small. Correspondingly, the true risk is and the empirical risk on training data S = {(|ψ i , With the scenario established, we can now apply the prediction error bound proved in Corollary 2. Here, it takes the form: Suppose that, given training data S of size N , our optimization of the parameters in the QCNN yields a parameter setting θ * = θ * (S). Then, with probability 1 − δ over the choice of training data, Therefore, a small training error guarantees a small prediction error already for training data size N ∈ O(poly(log(n))).
In other words, when using a QCNN for QPR, a good generalization error is already guaranteed for training data of size poly-logarithmic in n, the number of qubits. Thereby, our results provide a rigorous explanation for the good generalization behavior of QCNNs even for small training data size that was observed numerically in [25].

Supplementary Note 5. Application to unitary compiling
The second application of our generalization guarantees to be presented here is that of learning unitaries in the sense of (quantum-assisted) unitary compiling [38]. Unitary compiling is the task of finding a circuit representation of a target unitary, given black-box access to that unitary.
From a learning perspective, this motivates the following problem: For each training example (x i , y i ), the input is a pure n-qubit state ρ(x i ) = | ψ i ψ i |, and the corresponding label is the pure n-qubit state | φ i φ i | = U | ψ i ψ i |U † obtained by unitarily evolving the input state according to the (unknown) target unitary U . We consider the loss function given induced by the trace distance via where U QMLM α (·) = U α (·)U † α is a (unitary) quantum machine learning model, and we take |φ = U |ψ . As we are considering a trace distance between pure states, we can rewrite the loss function in terms of the fidelity (i.e., the overlap) as (α; |ψ , |φ ) = 1 − | φ | U α ψ | 2 = 1 − Tr | φ φ | · U QMLM α (|ψ ψ |) .
Hence, we see that this loss function is encompassed by our scenario, because we can write with loss observables O loss ψ,φ = 1 − | φ φ | (depending only on the quantum output, but not on the input). With (the above rewriting of) this loss function, the expected loss, when the expectation is w.r.t. drawing the input states independently at random from the Haar measure, becomes connected to the Hilbert-Schmidt inner product between the target unitary and the unitary implemented by the circuit. This, in turn, can be given an operational interpretation, as detailed in [38].
We solve this learning problem using a QMLM with a variable structure. (See Sections II.C and IV. of the main text for more details on how this is implemented.) In this scenario, Corollary 3 implies that, if we optimize over both (discrete) structures and (continuous) parameters and obtain an output structure k * with T parameterized gates with a parameter setting α * = (θ * , k * ), then, with probability 1 − δ over the choice of training data of size N , which is drawn i.i.d. from some distribution P over pure n-qubit states, we are guaranteed that assuming that the number of allowed structures with T gates scales at most exponentially in T . Here, theÕ hides terms logarithmic in T . Consequently, we know that, with high probability, the trace distance between the state obtained by applying the learned unitary on a new unseen input state (drawn at random from the data-generating distribution) and the true output state will be small if we can achieve a small average trace distance over the N randomly sampled states, where N scales roughly linearly in T . For many unitary gates of interest, namely those that can be efficiently implemented, we thus expect T , and thereby also N , to scale polynomially in n, the number of qubits. This is a substantial improvement over the training data sizes used in previous approaches to unitary compiling, which were often taken to be exponential in n such as to uniquely determine the unknown target unitary [36,37,42]. This improvement comes at the cost of not compiling the target unitary exactly, but only with a certain (small) accuracy and success probability. Nevertheless, for many applications, paying this cost is worthwhile, given the significant savings in training data size guaranteed by our results.
As a concrete example, the QFT discussed in Section II.C of the main text can be exactly implemented with T ∈ O(n 2 ) gates. In this case, our theory implies that N ∈ O(n 2 ) training data points are, with high probability, sufficient for good generalization. As discussed in [58], approximate implementations of the QFT are possible with a lower number of gates, namely with T ∈ O(n log(n)). Potentially, one can combine this insight with our result to obtain a similar improvement in the upper bound on the sufficient training data size.